VAST Challenge 2021 MC2 (I)

R Visualization Interactive Charts

To visualize & analyze card usage and car movement data with the employee disappearance incident

LIU Yangguang https://www.linkedin.com/in/ygliu/ (School of Computing and Information Systems, Singapore Management)
07-23-2021

Background

This study is based on the Mini-Challenge 2 of the VAST Challenge 2021. In a fiction scenario, there is a natural gas company named “GASTech” operating in the island country if Kronos. The GASTech didn’t do well in environment stewardship. And after an company IPO celebration in January 2014, several employees of GASTech went missing. An environment organization is suspected in the disappearance.

Many of the Abila, Kronos-based employees of GAStech have company cars which are approved for both personal and business use. Those who do not have company cars have the ability to check out company trucks for business use, but these trucks cannot be used for personal business.

Employees with company cars are happy to have these vehicles, because the company cars are generally much higher quality than the cars they would be able to afford otherwise. However, GAStech does not trust their employees. Without the employees? knowledge, GAStech has installed geospatial tracking software in the company vehicles. The vehicles are tracked periodically as long as they are moving.

This vehicle tracking data has been made available to law enforcement to support their investigation. Unfortunately, data is not available for the day the GAStech employees went missing. Data is only available for the two weeks prior to the disappearance.

To promote local businesses, Kronos based companies provide a Kronos Kares benefit card to GASTech employees giving them discounts and rewards in exchange for collecting information about their credit card purchases and preferences as recorded on loyalty cards. This data has been made available to investigators in the hopes that it can help resolve the situation. However, Kronos Kares does not collect personal information beyond purchases.

Requirement

Use visual analytics to identify which GASTech employees made which purchases and identify suspicious patterns of behavior. Besides, the study must cope with uncertainties that result from missing, conflicting, and imperfect data to make recommendations for further investigation.

Questions

  1. Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies? Please limit your answer to 8 images and 300 words.

  2. Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.

  3. Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.

  4. Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.

  5. Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why Please limit your response to 10 images and 500 words.

Literature review

The VAST Challenge 2014 has the same scenario with slightly different dataset and questions. The submission repository can be found here.

Various analytic tools were used among the submissions, like JMP, D3 and custom tools. The heatmap and time histograms were useful to represent the numerical value under the combination of one categorical variable and one discrete/categorical variable, such as the usage frequency under different locations and days. Besides, movement line graph with the map background can help to identify and check suspicious activities.

However, almost all graphs were static and readers would find it difficult to explore other parts in graphs which were not specially mentioned by authors. Since the study is displayed on html page, the interactive graphs will be possible. For example, the tooltip function can make every data point to have detailed information without checking the axis or drawing additional graphs. The zoom-in and onclick functions allow readers to check the whole complex graph with too many lines/objects and focus on one part only.

Data Preparation

Data Wrangling

Import packages.

Card Usage data

The location names contain some special characters, such as “Café”, which are not recognized by utf-8 encoding. Thus, special encoding is used in reading data.

loyalty <- read_csv("data/loyalty_data.csv", locale=locale(encoding ="windows-1252"))
cc <- read_csv("data/cc_data.csv", locale=locale(encoding ="windows-1252"))

Take a glimpse of credit card data and loyalty card data

knitr::kable(cc[c(0:5),],
             caption = "Credit Card  Usage Data") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 1: Credit Card Usage Data
timestamp location price last4ccnum
01/06/2014 07:28 Brew’ve Been Served 11.34 4795
01/06/2014 07:34 Hallowed Grounds 52.22 7108
01/06/2014 07:35 Brew’ve Been Served 8.33 6816
01/06/2014 07:36 Hallowed Grounds 16.72 9617
01/06/2014 07:37 Brew’ve Been Served 4.24 7384
knitr::kable(loyalty[c(0:5),],
             caption = "Loyalty Card  Usage Data") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 2: Loyalty Card Usage Data
timestamp location price loyaltynum
01/06/2014 Brew’ve Been Served 4.17 L2247
01/06/2014 Brew’ve Been Served 9.60 L9406
01/06/2014 Hallowed Grounds 16.53 L8328
01/06/2014 Coffee Shack 11.51 L6417
01/06/2014 Hallowed Grounds 12.93 L1107

The timestamp in the credit card usage date (“cc”) contains date and time, while the timestamp in the loyal card usage data (“loyalty”) contains only data. Besides, their data type is string, which will be transformed into datetime type.

And we separate day, hour from the datetime feature.

loyalty$timestamp <- as.Date(loyalty$timestamp, "%m/%d/%Y")
cc$timestamp <- strptime(cc$timestamp, "%m/%d/%Y %H:%M")

loyalty$day <- mday(loyalty$timestamp)
cc$date <- as.Date(cc$timestamp, "%m/%d/%Y %H:%M")
cc$day <- mday(cc$date)
cc$hour <- hour(cc$timestamp)
GPS and car assignments
gps <- read_csv("data/gps.csv")

knitr::kable(gps[c(0:5),],
             caption = "GPS Data") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 3: GPS Data
Timestamp id lat long
01/06/2014 06:28:01 35 36.07623 24.87469
01/06/2014 06:28:01 35 36.07622 24.87460
01/06/2014 06:28:03 35 36.07621 24.87444
01/06/2014 06:28:05 35 36.07622 24.87425
01/06/2014 06:28:06 35 36.07621 24.87417

The timestamp in the GPS data also need to be transformed.

And the longitude and latitude are rounded into 5 digits. It can avoid the inconsistent/inaccurate GPS data to some extent. And five decimal places implies 1.11 meters accuracy, which is better than 4 or 6 digits (11.1 meter or 0.11 meter accuracy) under this question scenario.

# transform features
gps$Timestamp <- strptime(gps$Timestamp, "%m/%d/%Y %H:%M:%S")
gps$day <- mday(gps$Timestamp)
### round the gps into 5 digits
gps$lat <- round(gps$lat, digits = 5)
gps$long <- round(gps$long, digits = 5)
# use individual gps2 to find stop locations
gps2 <- gps

In the challenge page, it mentioned that the vehicles are tracked periodically as long as they are moving. Thus, the time gap in the GPS data within one car indicates that this car stopped at current GPS location. Stops correspond to local business locations or other locations. To find these business locations, we excluded out the time gap less than 3 minutes, which might be that the car stopped to wait for traffic light.

gps2 <- gps2 %>% 
  group_by(id) %>% 
  mutate(end = Timestamp,
         start = lag(Timestamp, default = first(Timestamp),
                   order_by = Timestamp),
         diff_mins = difftime(end, start, units = "mins")) %>% 
  mutate(stop = ifelse(diff_mins >= 3, TRUE, FALSE)) %>% 
  filter(stop == TRUE) %>% 
  ungroup()
# rearrange useful features
gps2_stop <- gps2[c(7,6,2,3,4,8,5)]

gps2_stop_sf <- st_as_sf(gps2_stop,
                         coords = c("long", "lat"), # combine the lo, la
                         crs = 4326) # 4326 is wgs84 Geographic Coordinate System

The “start” in the “gps2_stop_sf” refers to the time when the car starts parking, while the “end” refers to the time when the car ends parking.

Besides, most vehicles are assigned one-to-one. Only truck drivers are not assigned cars but are allowed to use available truck for business

car_assignments <- read_csv("data/car-assignments.csv")
# check car assignment data
knitr::kable(car_assignments,
             caption = "Car assignment") %>% 
  kableExtra::kable_paper("hover", full_width = F) %>% 
  kableExtra::scroll_box(height = "300px")
Table 4: Car assignment
LastName FirstName CarID CurrentEmploymentType CurrentEmploymentTitle
Calixto Nils 1 Information Technology IT Helpdesk
Azada Lars 2 Engineering Engineer
Balas Felix 3 Engineering Engineer
Barranco Ingrid 4 Executive SVP/CFO
Baza Isak 5 Information Technology IT Technician
Bergen Linnea 6 Information Technology IT Group Manager
Orilla Elsa 7 Engineering Drill Technician
Alcazar Lucas 8 Information Technology IT Technician
Cazar Gustav 9 Engineering Drill Technician
Campo-Corrente Ada 10 Executive SVP/CIO
Calzas Axel 11 Engineering Hydraulic Technician
Cocinaro Hideki 12 Security Site Control
Ferro Inga 13 Security Site Control
Dedos Lidelse 14 Engineering Engineering Group Manager
Bodrogi Loreto 15 Security Site Control
Vann Isia 16 Security Perimeter Control
Flecha Sven 17 Information Technology IT Technician
Frente Birgitta 18 Engineering Geologist
Frente Vira 19 Engineering Hydraulic Technician
Fusil Stenig 20 Security Building Control
Osvaldo Hennie 21 Security Perimeter Control
Nubarron Adra 22 Security Badging Office
Lagos Varja 23 Security Badging Office
Mies Minke 24 Security Perimeter Control
Herrero Kanon 25 Engineering Geologist
Onda Marin 26 Engineering Drill Site Manager
Orilla Kare 27 Engineering Drill Technician
Borrasca Isande 28 Engineering Drill Technician
Ovan Bertrand 29 Facilities Facilities Group Manager
Resumir Felix 30 Security Security Group Manager
Sanjorge Jr.  Sten 31 Executive President/CEO
Strum Orhan 32 Executive SVP/COO
Tempestad Brand 33 Engineering Drill Technician
Vann Edvard 34 Security Perimeter Control
Vasco-Pais Willem 35 Executive Environmental Safety Advisor
Hafon Albina NA Facilities Truck Driver
Hawelon Benito NA Facilities Truck Driver
Hawelon Claudio NA Facilities Truck Driver
Mies Henk NA Facilities Truck Driver
Morlun Valeria NA Facilities Truck Driver
Morlun Adan NA Facilities Truck Driver
Morluniau Cecilia NA Facilities Truck Driver
Nant Irene NA Facilities Truck Driver
Scozzese Dylan NA Facilities Truck Driver
gps2_stop_sf <- left_join(gps2_stop_sf, 
                          car_assignments, by = c("id" = "CarID"))

Lastly, we also need to draw the car movement path on the map. It requires the GPS data to be coordinate formats and one path is actually one line string with multiple GPS points.

# convert values from numerical to factor data type
gps$day <- as.factor(gps$day)
gps$id <- as_factor(gps$id)

gps_sf <- st_as_sf(gps,
                   coords = c("long", "lat"),
                   crs = 4326)
# group car paths
gps_path <- gps_sf %>%
  group_by(id, day) %>%
  summarize(m =mean(Timestamp),
            do_union=FALSE) %>%
  st_cast("LINESTRING")

QGIS

The tourist map provided is not georeferenced. And QGIS can help to georeference an image with the ESRI shapefiles (geospatial vector data) of the city.

The process includes:

  1. load JPG tourist map and shp road map
  2. create several referencing points between two maps
  3. start georeferencing maps and check the correspondence

After the process, we will get a tif file which is a combination of tourist map and georeferenced road map. Then we can plot car movements line with longitude and latitude data on the map.

we need to import the tif file generated by QGIS and display the map.

bgmap <- raster("data/Geospatial/MC2-tourist.tif")

Visualization and Insights

Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?

To identify popularity, we can calculate the card usage frequency and amount in every locations of different days and hours.

Firstly, let’s plot the frequency of cards in the 14 days. We need to calculate the card usage frequency in different days, convert into data frame, draw their heatmaps and plot together.

Q1-Fig1 Code
# calculate the frequency data frame of credit and loyalty card usage
cc_freq_day <- as.data.frame(xtabs(~location+day, data = cc))
loyalty_freq_day <- as.data.frame(xtabs(~location+day, data = loyalty))

# join the two frequency data frame
freq_day_join <- full_join(cc_freq_day,loyalty_freq_day,by= c("location","day"))
names(freq_day_join) <- c("location","day","CC_Freq","Loyalty_Freq")
# transfer from factors to numeric with original values
freq_day_join$day <- as.numeric(levels(freq_day_join$day))[freq_day_join$day]
# plot the heatmap of credit card usage frequency 
p1 <- ggplot(freq_day_join,aes(x=day,y=location))+
  geom_tile(aes(fill=CC_Freq))+
  scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        legend.title=element_blank())
# plot the heatmap of loyalty card usage frequency 
p2 <- ggplot(freq_day_join,aes(x=day,y=location))+
  geom_tile(aes(fill=Loyalty_Freq))+
  scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        legend.title=element_blank())
# convert static graph into interactive
plotly::subplot(ggplotly(p1),
                ggplotly(p2),
                shareY = TRUE)

Figure 1: Daily Frequency of Credit (left) and Loyalty (right) Card Usage

From the card usage frequency (or consumption frequency), we can easily identify that “Katerina’s Café”, “Hippokampos” and “Brew’ve Been Served” are the most popular with almost all squares in deeper color, where the daily consumption frequency is above 10. “Hallowed Grounds” and “Guy’s Gyros” are slightly less popular.

Besides, we can find that “Brew’ve Been Served” and “Hallowed Grounds” are popular every day except weekends (day 11-12, 18-19). The frequency are 0 on weekends, which might because the location is closed on weekends. It’s the same to “Hallowed Grounds”.

On weekends, “Katerina’s Café” and “Hippokampos” are the most popular while other locations might be closed or less consumption these days.

As for anomalies, we can see there is one white line in the graph for loyalty card, corresponding to “Daily Dealz”. This location only have one credit card consumption record on day 13 and no loyalty card record among the two weeks.

The daily frequencies are the same between “Maximum Iron and Steel” and “Kronos Pipe and Irrigation” every day in the two weeks.

To correct these anomalies, we can check the GPS data to make sure who made the only one consumption in “Daily Dealz”. If there were no anomalies after checking, we can just delete this single record in the credit card data. And for the situation between “Maximum Iron and Steel” and “Kronos Pipe and Irrigation”, it’s just coincidence after checking the consumption amount.

Secondly, we can plot the consumption amount instead of frequency. The steps are almost the same.

Q1-Fig2 Code
cc_price_matrix <- tapply(cc$price,cc[,c("location","day")],sum)
cc_price <- reshape2::melt(cc_price_matrix)
cc_price <- na.omit(cc_price)
p1_price <- ggplot(cc_price,aes(x=day,y=location))+
  geom_tile(aes(fill=value))+
  scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        legend.title=element_blank())

ggplotly(p1_price)

Figure 2: Daily Consumption Amount of Credit Card

Q1-Fig3 Code
plot_ly(cc_price, x = ~value, y = ~location, type = "box",
        boxpoints = "outliers", marker = list(color= 'rgb(255,0,0)')) %>% 
  layout(showlegend = FALSE)

Figure 3: Daily Consumption Amount of Credit Card (box plot)

The consumption amount differences among locations are much bigger than frequency differences.

Apparently, “Abila Airport” are the place where has the biggest consumption amount. And these consumption occurred on weekdays only.

Besides, “Stewart and Sons Fabrication”, “Nationwide Refinery” and “Abila Airport” also have high consumption amounts. All these locations don’t show high frequency values in previous graphs but have very high daily consumption amounts.

And there are many outliers which might be anomalies. For example, “Frydos Autosupply n’ More” had a daily cc consumption amount ($10455.22) on day 13, which is several times as much as those in other days. And the “Albert’s Fine Clothing” also has a daily consumption outlier on day 17.

What’s more, there are many inconsistencies between amounts in the credit card record and loyalty card record. At “Stewart and Sons Fabrication”, the daily amounts from day 13 to day 16 don’t match in two graphs.

To correct these anomalies, we need to check through the car movement data where the consumption amount outliers exist. It’s to see whether there are activities or other gathering to cause the high consumption. As for the inconsistency in amounts, the possible explanations are there might be someone used only one of the two cards or got cashback in the consumption.

Lastly, we change the time unit from days to hours to analyze the popular locations. Only the timestamp of credit card data contains time, so there are no hourly heatmaps for loyalty card usage.

Q1-Fig4
cc_freq_hour <- as.data.frame(xtabs(~location+hour, data = cc))
# convert factor into number
cc_freq_hour$hour <- as.numeric(levels(cc_freq_hour$hour))[cc_freq_hour$hour]

cc_price_hour_matrix <- tapply(cc$price,cc[,c("location","hour")],sum)
cc_price_hour <- reshape2::melt(cc_price_hour_matrix)

cc_hour_join <- full_join(cc_freq_hour, cc_price_hour, by= c("location","hour"))
names(cc_hour_join) <- c("location","hour","Freq","Amount")

p3_freq <- ggplot(cc_hour_join,aes(x=hour,y=location))+
  geom_tile(aes(fill=Freq))+
  scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank())

p3_price <- ggplot(cc_hour_join,aes(x=hour,y=location))+
  geom_tile(aes(fill=Amount))+
  scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank())

plotly::subplot(ggplotly(p3_freq),
        ggplotly(p3_price),
        shareY = TRUE) %>% 
  hide_colorbar()

Figure 4: Hourly Consumption Frequency and Amount of Credit Card

From the left hourly heatmap, we can easily identify the popular period for each locations since there are clear pattern.

And some anomalies exist in the strange time period. At 3am, there are 5 credit card usages in “Kronos Mart”. For “Daily Dealz”, the only credit card transaction happened at 6am.

The right heatmap also shows anomalies: the consumption at “Bean There Done That” and “Coffee Shack” all happened at 12 o’clock.

knitr::kable(cc %>% 
               filter(location == "Bean There Done That" | location == "Coffee Shack" ),
             caption = "Consumption record at Bean There Done That and Coffee Shack ") %>% 
  kableExtra::kable_paper("hover", full_width = F) %>% 
  kableExtra::scroll_box(height = "300px") 
Table 5: Consumption record at Bean There Done That and Coffee Shack
timestamp location price last4ccnum date day hour
2014-01-06 12:00:00 Coffee Shack 51.51 7117 2014-01-06 6 12
2014-01-06 12:00:00 Bean There Done That 10.28 1415 2014-01-06 6 12
2014-01-06 12:00:00 Bean There Done That 16.09 9635 2014-01-06 6 12
2014-01-06 12:00:00 Bean There Done That 8.54 1877 2014-01-06 6 12
2014-01-06 12:00:00 Bean There Done That 5.24 1321 2014-01-06 6 12
2014-01-06 12:00:00 Bean There Done That 4.27 6895 2014-01-06 6 12
2014-01-07 12:00:00 Coffee Shack 16.63 7117 2014-01-07 7 12
2014-01-07 12:00:00 Bean There Done That 51.25 1415 2014-01-07 7 12
2014-01-07 12:00:00 Bean There Done That 53.89 1877 2014-01-07 7 12
2014-01-07 12:00:00 Bean There Done That 8.03 1321 2014-01-07 7 12
2014-01-07 12:00:00 Bean There Done That 46.25 6895 2014-01-07 7 12
2014-01-08 12:00:00 Coffee Shack 13.13 7117 2014-01-08 8 12
2014-01-08 12:00:00 Bean There Done That 17.45 1415 2014-01-08 8 12
2014-01-08 12:00:00 Bean There Done That 12.07 9635 2014-01-08 8 12
2014-01-08 12:00:00 Bean There Done That 8.06 1877 2014-01-08 8 12
2014-01-08 12:00:00 Bean There Done That 94.96 1321 2014-01-08 8 12
2014-01-09 12:00:00 Coffee Shack 5.01 7117 2014-01-09 9 12
2014-01-09 12:00:00 Bean There Done That 13.00 1415 2014-01-09 9 12
2014-01-09 12:00:00 Bean There Done That 4.06 1877 2014-01-09 9 12
2014-01-09 12:00:00 Bean There Done That 15.53 1321 2014-01-09 9 12
2014-01-10 12:00:00 Bean There Done That 15.39 1415 2014-01-10 10 12
2014-01-10 12:00:00 Bean There Done That 3.92 9635 2014-01-10 10 12
2014-01-10 12:00:00 Bean There Done That 13.41 1877 2014-01-10 10 12
2014-01-10 12:00:00 Bean There Done That 18.67 1321 2014-01-10 10 12
2014-01-10 12:00:00 Bean There Done That 9.30 6895 2014-01-10 10 12
2014-01-13 12:00:00 Coffee Shack 4.09 7117 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 7.93 1415 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 98.02 9635 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 19.56 1877 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 5.54 1321 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 10.37 1874 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 15.24 9617 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 18.59 6895 2014-01-13 13 12
2014-01-14 12:00:00 Coffee Shack 3.63 7117 2014-01-14 14 12
2014-01-14 12:00:00 Bean There Done That 7.28 1415 2014-01-14 14 12
2014-01-14 12:00:00 Bean There Done That 19.37 9635 2014-01-14 14 12
2014-01-14 12:00:00 Bean There Done That 14.56 1877 2014-01-14 14 12
2014-01-14 12:00:00 Bean There Done That 67.82 1321 2014-01-14 14 12
2014-01-14 12:00:00 Bean There Done That 5.90 6895 2014-01-14 14 12
2014-01-15 12:00:00 Bean There Done That 5.34 9635 2014-01-15 15 12
2014-01-15 12:00:00 Bean There Done That 93.13 1877 2014-01-15 15 12
2014-01-15 12:00:00 Bean There Done That 18.96 1321 2014-01-15 15 12
2014-01-15 12:00:00 Bean There Done That 15.24 6895 2014-01-15 15 12
2014-01-16 12:00:00 Coffee Shack 10.48 7117 2014-01-16 16 12
2014-01-16 12:00:00 Bean There Done That 19.82 1415 2014-01-16 16 12
2014-01-16 12:00:00 Bean There Done That 12.33 9635 2014-01-16 16 12
2014-01-16 12:00:00 Bean There Done That 5.56 1877 2014-01-16 16 12
2014-01-16 12:00:00 Bean There Done That 98.34 1321 2014-01-16 16 12
2014-01-16 12:00:00 Bean There Done That 17.40 6895 2014-01-16 16 12
2014-01-17 12:00:00 Coffee Shack 19.16 7117 2014-01-17 17 12
2014-01-17 12:00:00 Bean There Done That 19.67 1415 2014-01-17 17 12
2014-01-17 12:00:00 Bean There Done That 8.21 9635 2014-01-17 17 12
2014-01-17 12:00:00 Bean There Done That 19.56 1877 2014-01-17 17 12
2014-01-17 12:00:00 Bean There Done That 18.72 1321 2014-01-17 17 12
2014-01-17 12:00:00 Bean There Done That 5.77 6895 2014-01-17 17 12

We can see that all consumption records are at 12:00:00. There might be something wrong with the machine which records credit card consumption at the two location.

Q2: GPS Data and Anomalies

Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?

Anomaly 1: High consumption at “Frydos Autosupply n’ More” on day 13, Figure 2

Firstly, filter out the credit card consumption record at “Frydos Autosupply n’ More” on day 13.

knitr::kable(cc %>% 
               filter(day == 13 & location == "Frydos Autosupply n' More"),
             caption = "Consumption record at Frydos Autosupply n' More on day 13") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 6: Consumption record at Frydos Autosupply n’ More on day 13
timestamp location price last4ccnum date day hour
2014-01-13 19:20:00 Frydos Autosupply n’ More 10000.00 9551 2014-01-13 13 19
2014-01-13 19:41:00 Frydos Autosupply n’ More 188.57 8129 2014-01-13 13 19
2014-01-13 19:59:00 Frydos Autosupply n’ More 64.60 8411 2014-01-13 13 19
2014-01-13 21:11:00 Frydos Autosupply n’ More 202.05 2418 2014-01-13 13 21

The abnormal consumption is from the cc number 9551. Let’s check the consumption records of this cc on day 13.

knitr::kable(cc %>% 
               filter(day == 13 & last4ccnum == 9551),
             caption = "Consumption record of cc 9551 owner on day 13") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 7: Consumption record of cc 9551 owner on day 13
timestamp location price last4ccnum date day hour
2014-01-13 06:04:00 Daily Dealz 2.01 9551 2014-01-13 13 6
2014-01-13 13:18:00 U-Pump 55.25 9551 2014-01-13 13 13
2014-01-13 13:28:00 Hippokampos 30.51 9551 2014-01-13 13 13
2014-01-13 19:20:00 Frydos Autosupply n’ More 10000.00 9551 2014-01-13 13 19
2014-01-13 19:30:00 Ouzeri Elian 28.75 9551 2014-01-13 13 19

We can see the cc owner make the only only transaction at “Daily Dealz” at early morning (6 o’clock), which is the only one transaction in the two weeks.

Besides, “U-Pump” is a special place because there were only two consumption records in the two weeks, which can be found in Figure 4 and Figure 1. Therefore, there should have fewer stop locations near U-Pump in the car GPS data.

We can check the stop locations on day 13. On this day, there was one point near U-Pump where the stop time is near the consumption time in “U-Pump”. The corresponding car id is 24.

Thus, we think that Minke, the owner of car 24, use the credit card 9551. Let’s draw the moving path of this car to discover more. All elements are draw in the same plot to enhance understanding

Q2-Fig1 Code
gps2_stop_day13 <- gps2_stop_sf %>% 
  filter(day ==13)

gps2_stop_car24_day13 <- gps2_stop_sf %>% 
  filter(day ==13 & id == 24)

gps_path_car24_day13 <- gps_path %>% 
  filter(day == 13 & id == 24)

map1 <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps_path_car24_day13) +
  tm_lines(col = "blue") +
  tm_shape(gps2_stop_day13) +
  tm_dots() +
  tm_shape(gps2_stop_car24_day13) +
  tm_dots(col = "blue", size = 0.1)
tmap_leaflet(map1)

Figure 5: Stop locations and driving path of car 24 on day 13

Hovering over blue dots, we can see the stop locations of car 24. On day 13, the car started running at about 7 o’clock from home (the east area in the map) and stopped at “Katerina’s Café” (the south-east area) for half an hour. Then, the car stopped near “Albert’s Fine Clothing” at around noon (the north-west area).

After that, the car stopped near “U-Pump” (the center area) from 12:35 to 13:22. The purchase time in “U-Pump”, 13:18:00, matches the time period.

From 13:27 to 17:57, the car stopped at the GASTech company (south area), which could be the employee was working.

After the work, the car stopped near the “Brew’ve Been Served” (the south-east area) from 18:00 to 19:29. The high consumption occured in this period. The stop location is also close to the “Frydos Autosupply n’ More”. So the driver might stopped the car and walked to the “Frydos Autosupply n’ More” to make the consumption.

There are strange things.

  1. The consumption at “Daily Dealz” occurred at 06:04:00, while the car left home at 07:32:01. It’s strange that the purchase happedned so early and the location can’t be found in other records

  2. The consumption at “Hippokampos” occurred at 13:28:00, while the car stopped at the company at 13:27:14. The time gap is about 30 seconds

  3. The consumption at “Ouzeri Elian” occurred at 19:30:00, while the car left the “Frydos Autosupply n’ More” at 19:29:01. The time gap is just 30 seconds after the car left

We can check the consumption from the combination of credit and loyalty cards data. We use left join to find the corresponding records in the loyalty data.

knitr::kable(cc %>% 
               filter(day == 13 & last4ccnum == 9551) %>% 
               left_join(loyalty, by = c("location", "day", "price")),
             caption = "Consumption record of cc 9551 with corresponding loyalty records on day 13") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 8: Consumption record of cc 9551 with corresponding loyalty records on day 13
timestamp.x location price last4ccnum date day hour timestamp.y loyaltynum
2014-01-13 06:04:00 Daily Dealz 2.01 9551 2014-01-13 13 6 NA NA
2014-01-13 13:18:00 U-Pump 55.25 9551 2014-01-13 13 13 NA NA
2014-01-13 13:28:00 Hippokampos 30.51 9551 2014-01-13 13 13 2014-01-13 L5777
2014-01-13 19:20:00 Frydos Autosupply n’ More 10000.00 9551 2014-01-13 13 19 NA NA
2014-01-13 19:30:00 Ouzeri Elian 28.75 9551 2014-01-13 13 19 2014-01-13 L5777

We can see that the two consumption records, which have little time gap with the car leaving/stopping, exactly have corresponding loyalty card usage. But the other three consumption records should be less rush but didn’t use loyalty card. One possible explaination might be the card stealing. This suspicious activity need to be analyzed further in question 5.

Anomaly 2: Mid-night consumption at “Kronos Mart”, Figure 4

The first step is to find the corresponding records.

knitr::kable(cc %>% 
               filter(location == "Kronos Mart"),
             caption = "Consumption at Kronos Mart") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 9: Consumption at Kronos Mart
timestamp location price last4ccnum date day hour
2014-01-10 09:30:00 Kronos Mart 203.91 7688 2014-01-10 10 9
2014-01-12 03:39:00 Kronos Mart 277.26 8156 2014-01-12 12 3
2014-01-13 03:00:00 Kronos Mart 147.30 5407 2014-01-13 13 3
2014-01-13 08:01:00 Kronos Mart 159.06 6816 2014-01-13 13 8
2014-01-14 08:20:00 Kronos Mart 58.85 6899 2014-01-14 14 8
2014-01-16 07:30:00 Kronos Mart 298.83 7108 2014-01-16 16 7
2014-01-17 08:08:00 Kronos Mart 286.24 1415 2014-01-17 17 8
2014-01-19 03:13:00 Kronos Mart 87.66 3484 2014-01-19 19 3
2014-01-19 03:45:00 Kronos Mart 194.51 9551 2014-01-19 19 3
2014-01-19 03:48:00 Kronos Mart 150.36 8332 2014-01-19 19 3

The strange consumption records are the last 3 rows, which occurred at 3 o’clock on day 19 by the owner of credit cards 3484, 9551, 8332.

Coincidentally, credit card 9551 also appeared in the Anomaly 1.

Day 19 is one day before the employee missing incident. We can check the car stop points in the recent one week to find the reason or any anomalies.

Q2-Fig2 Code
gps2_stop_days <- gps2_stop_sf %>%
  filter(between(day,13,18))

gps2_stop_day19 <- gps2_stop_sf %>%
  filter(day == 19)

map2 <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps2_stop_days) +
  tm_dots(size = 0.1, alpha = 0.5) +
  tm_shape(gps2_stop_day19) +
  tm_dots(col = "red", size = 0.1, alpha = 0.5)
tmap_leaflet(map2)

Figure 6: Stop locations from day 13 to day 19

The “Kronos Mart” is located at the west direction with a red symbol. After zooming the map, we can see there were no car stop location near the mart On day 19 (red dot). And several closer red dot, which located at “Roberts and Sons”, were in the afternoon (stop period within 13 to 14 o’clock).

But there are three black dot which are very close to the “Kronos Mart”. Furthermore, The three car stop all started at about 13:30 and ends at about 16:00 on day 18.

The owners of the three cars are listed.

gps2_stop_days %>% 
               filter((id == 1 | id == 10 |id == 23) 
                      & day == 18
                      & start > "2014-01-18 13:00:00"
                      & end < "2014-01-18 16:00:00") %>% 
knitr::kable(caption = "The three car stop near Kronos Mart") %>% 
  kableExtra::kable_paper("hover", full_width=T)
Table 10: The three car stop near Kronos Mart
start end id diff_mins day geometry LastName FirstName CurrentEmploymentType CurrentEmploymentTitle
2014-01-18 13:48:01 2014-01-18 15:14:01 23 86.0 mins 18 POINT (24.8498 36.06586) Lagos Varja Security Badging Office
2014-01-18 13:29:31 2014-01-18 15:52:01 10 142.5 mins 18 POINT (24.84983 36.06588) Campo-Corrente Ada Executive SVP/CIO
2014-01-18 13:36:43 2014-01-18 15:58:01 1 141.3 mins 18 POINT (24.84982 36.06582) Calixto Nils Information Technology IT Helpdesk

We can’t get insights from the car owner information since they belong to different employment type. But they stayed at the same location for similar time period. They are very likely to meet each other and do the same thing. Besides, the consumption at 3 o’clock came from 3 credit cards and this meetup also involved in 3 persons.

Thus, one possible explanation of the consumption at mid-night could be that the three car owners came to discuss some plans on day 18 and met again at 3 o’clock on day 19.

Another possible explanation direction could be persons just stayed near the mart, so they don’t need to drive and walked there to make consumption. Or the three person use other vehicles, not from the company, to reach the mart and make consumption.

This suspicious activities will be analyzed further in question 5. It might need to check the behaviors of the three car owners in the 14 days.

Discrepancies between data
cc_num <- length(unique(cc$last4ccnum))
loyalty_num <- length(unique(loyalty$loyaltynum))
ppl_num <- length(car_assignments$LastName)
c(cc_num, loyalty_num, ppl_num)
[1] 55 54 44

We can find that there are 44 employees, but 55 credit cards and 54 loyalty cards. If we suppose no errors in the card id, one employee has one or more credit cards and one or more loyalty cards.


Interactive graphs make it difficult to load pages, so this study is divided into two blogs.

To be continued with Part 2